{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google+ Demo\n",
    "This notebook contains a detailed example, demonstrating the typical workflow Graft aims to support. The dataset used here, [Google+](http://snap.stanford.edu/data/egonets-Gplus.html), was obtained from the Stanford Network Analysis Platform,\n",
    "Reference : `J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.`\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DataSet summary:\n",
    "* `vertices` : 107614\n",
    "* `edges` : 51127\n",
    "\n",
    "The Google+ data is essentially a network of professionals across the world. Each vertex or person,\n",
    "has the following attributes attached:\n",
    "\n",
    "1. `gender` : enum\n",
    "2. `institute` : An array containing keywords describing the person's workplace\n",
    "3. `job_title` : An array containing keywords describing the person's role\n",
    "4. `last_name`\n",
    "5. `place` : An array containing places the person has worked/lived\n",
    "6. `university` : An array containing keywords describing the universities a person has attended\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing\n",
    "The dataset is in the form of a ego-network, and contains a set of files for each ego-node:\n",
    "1. `nodeId.edges` : The edges in the ego network for the node 'nodeId'. The 'ego' node does not appear, but it is assumed that they\n",
    "follow every node id that appears in this file.\n",
    "2. `nodeId.feat` : The features for each of the nodes that appears in the edge file.\n",
    "3. `nodeId.egofeat` : The features for the ego user.\n",
    "4. `nodeId.featnames` : The names of each of the feature dimensions. Features are '1' if the user has this property in their profile, and '0' otherwise.\n",
    "\n",
    "The structure of the vertex metadata is quite awkward, but nothing a bit of preprocessing can't handle:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Progress:  95%  ETA: 0:00:18"
     ]
    }
   ],
   "source": [
    "using Graft\n",
    "using StatsBase\n",
    "import LightGraphs\n",
    "\n",
    "# Fetch the dataset\n",
    "# Uncompress the vertex metadata and convert to TSV\n",
    "# Write the vertex metadata to vertex_data.txt\n",
    "# Initialize the graph file, Graph.txt, with a header\n",
    "include(joinpath(Pkg.dir(\"Graft\", \"examples/build_dataset.jl\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    ";awk '!seen[$1]++' vertex_data.txt > vdata.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    ";awk '!seen[$0]++' gplus_combined.txt | tr ' ' '\\t' > edata.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    ";cat vdata.txt edata.txt >> Graph.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13781072"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The graph dataset is now stored in Graph.txt\n",
    "countlines(\"Graph.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the Graph into memory\n",
    "Graft provides the `loadgraph` method to extract graphs from files, but it supports only a form of TSV at the moment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fetching Graph Header\n",
      "Loading Vertex Data\n",
      "Progress: 100% Time: 0:00:43\n",
      "Loading Edge Data\n",
      "Progress:  98%  ETA: 0:00:01"
     ]
    }
   ],
   "source": [
    "g = loadgraph(\"Graph.txt\"; verbose=true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(107614,13673453)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the graph's size\n",
    "size(g)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Graph Queries\n",
    "Now that the graph is loaded into memory, we can start mining interesting information from the graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "top5 (generic function with 1 method)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Function to fetch the 5 most frequent entries\n",
    "top5(x) = sort(collect(countmap(vcat(filter(y->length(y) > 0, collect(x))...))), by=x->x[2], rev=true)[1 : 5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5-element Array{Pair{String,Int64},1}:\n",
       " \"Stanford University\"=>452                  \n",
       " \"Polytechnic University of Puerto Rico\"=>105\n",
       " \"East Carolina University\"=>91              \n",
       " \"University of Utah\"=>86                    \n",
       " \"Colorado State University\"=>84             "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find the universities where alumni are well connected\n",
    "@query(g |> filter(s.university == t.university) |> eachedge(s.university)) |> top5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5-element Array{Pair{String,Int64},1}:\n",
       " \"Stanford University\"=>182               \n",
       " \"University of California, Berkeley\"=>103\n",
       " \"University of Phoenix\"=>87              \n",
       " \"University of Michigan\"=>82             \n",
       " \"Harvard University\"=>75                 "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If you work for Google, which schools did people in your network go to?\n",
    "network = hopgraph(g, @query(g |> filter(\"Google\" in v.institution) |> eachvertex(v.label)), 1)\n",
    "@query(network |> eachvertex(v.university)) |> top5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5-element Array{Pair{String,Int64},1}:\n",
       " \"University of Southern California\"=>13    \n",
       " \"University of California, Los Angeles\"=>12\n",
       " \"University of California, Berkeley\"=>8    \n",
       " \"Columbia University\"=>7                   \n",
       " \"New York University\"=>7                   "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find the most popular schools in Los Angeles\n",
    "@query(g |> filter(\"Los Angeles\" in v.place) |> eachvertex(v.university)) |> top5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5-element Array{Pair{String,Int64},1}:\n",
       " \"London\"=>2105           \n",
       " \"New York\"=>1577         \n",
       " \"San Francisco, CA\"=>1381\n",
       " \"Chicago, IL\"=>1163      \n",
       " \"San Francisco\"=>1131    "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find cities that are well connected to New York\n",
    "@query(g |> filter(\"New York\" in s.place) |> eachedge(t.place)) |> top5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Run page rank, using LightGraphs, and set the result as a vertex property\n",
    "M = export_adjacency(g)\n",
    "setvprop!(g, :, LightGraphs.pagerank(LightGraphs.DiGraph(M)), :pagerank);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "│ VertexID │ Labels                │ gender │ last_name       │ pagerank   │\n",
       "├──────────┼───────────────────────┼────────┼─────────────────┼────────────┤\n",
       "│ 1        │ 114985346359714431656 │ 1      │ \"non-ascii_620\" │ 6.48789e-6 │\n",
       "│ 2        │ 111065108889012087599 │ 1      │ \"dan\"           │ 7.756e-5   │\n",
       "│ 3        │ 113204882497955654314 │ 1      │ NA              │ 6.3443e-6  │\n",
       "│ 4        │ 116860750964767060846 │ 1      │ \"steven\"        │ 1.08936e-5 │\n",
       "│ 5        │ 109870053628419941069 │ 0      │ NA              │ 4.64961e-5 │\n",
       "│ 6        │ 108249232416813189685 │ 1      │ \"jonathan\"      │ 3.61508e-5 │\n",
       "│ 7        │ 111193388731102401849 │ 1      │ \"scott\"         │ 4.4071e-6  │\n",
       "│ 8        │ 111737859526639530840 │ 2      │ NA              │ 8.02871e-6 │\n",
       "│ 9        │ 112299972688047529628 │ 1      │ \"todd\"          │ 4.42207e-6 │\n",
       "│ 10       │ 102945758979783986480 │ 1      │ \"adam\"          │ 4.87197e-5 │\n",
       "│ 11       │ 100710846514306801296 │ 1      │ \"daniel\"        │ 2.36998e-5 │\n",
       "⋮\n",
       "│ 107603   │ 106569813518583803263 │ 1      │ NA              │ 3.88311e-6 │\n",
       "│ 107604   │ 108762954268468707753 │ 1      │ NA              │ 1.00141e-5 │\n",
       "│ 107605   │ 115866686692802419945 │ 1      │ NA              │ 5.79192e-6 │\n",
       "│ 107606   │ 110854930685324287139 │ 1      │ NA              │ 6.4456e-6  │\n",
       "│ 107607   │ 116320812345106679256 │ 2      │ NA              │ 5.4473e-6  │\n",
       "│ 107608   │ 114679782050104286479 │ 1      │ NA              │ 6.20278e-6 │\n",
       "│ 107609   │ 109172913211802298398 │ 1      │ \"chris\"         │ 3.88311e-6 │\n",
       "│ 107610   │ 109553288557762375881 │ 1      │ NA              │ 4.83773e-6 │\n",
       "│ 107611   │ 102953973878117962383 │ 1      │ \"adam\"          │ 8.05326e-6 │\n",
       "│ 107612   │ 116863879615717543881 │ 1      │ \"wolfgang\"      │ 5.96955e-6 │\n",
       "│ 107613   │ 102977664233671324384 │ 1      │ NA              │ 3.88311e-6 │\n",
       "│ 107614   │ 109995768273040439562 │ 1      │ NA              │ 1.5836e-5  │"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print out the vertex descriptor with a few properties\n",
    "VertexDescriptor(@query(g |> select(v.gender, v.last_name, v.pagerank)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "│ Index    │ Source                │ Target                │ mutual_friends │\n",
       "├──────────┼───────────────────────┼───────────────────────┼────────────────┤\n",
       "│ 1        │ 114985346359714431656 │ 107211094246725549672 │ 38             │\n",
       "│ 2        │ 114985346359714431656 │ 106467891941793155295 │ 2              │\n",
       "│ 3        │ 114985346359714431656 │ 100043057758270223301 │ 84             │\n",
       "│ 4        │ 114985346359714431656 │ 107831413340697273273 │ 4              │\n",
       "│ 5        │ 114985346359714431656 │ 114839638425953508537 │ 68             │\n",
       "│ 6        │ 114985346359714431656 │ 113114462378360775452 │ 71             │\n",
       "│ 7        │ 114985346359714431656 │ 116407635616074189669 │ 30             │\n",
       "│ 8        │ 114985346359714431656 │ 104428814384443083380 │ 50             │\n",
       "│ 9        │ 114985346359714431656 │ 111159036121102171686 │ 20             │\n",
       "│ 10       │ 114985346359714431656 │ 106898588952511738977 │ 71             │\n",
       "│ 11       │ 114985346359714431656 │ 105228342880444036996 │ 63             │\n",
       "⋮\n",
       "│ 13673442 │ 109995768273040439562 │ 111754564023086319115 │ 51             │\n",
       "│ 13673443 │ 109995768273040439562 │ 112198806072857130680 │ 61             │\n",
       "│ 13673444 │ 109995768273040439562 │ 110060537346086731221 │ 21             │\n",
       "│ 13673445 │ 109995768273040439562 │ 107841579136101149046 │ 8              │\n",
       "│ 13673446 │ 109995768273040439562 │ 114294515409520683622 │ 11             │\n",
       "│ 13673447 │ 109995768273040439562 │ 117362202493439608589 │ 89             │\n",
       "│ 13673448 │ 109995768273040439562 │ 118332792877295434028 │ 71             │\n",
       "│ 13673449 │ 109995768273040439562 │ 101941479331974800200 │ 66             │\n",
       "│ 13673450 │ 109995768273040439562 │ 106815259454822754885 │ 66             │\n",
       "│ 13673451 │ 109995768273040439562 │ 105657594395895234072 │ 71             │\n",
       "│ 13673452 │ 109995768273040439562 │ 102367310168069130255 │ 38             │\n",
       "│ 13673453 │ 109995768273040439562 │ 102953973878117962383 │ 38             │"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find the number of mutual friends between the source and target vertices for each edge\n",
    "seteprop!(g, :, @query(g |> eachedge(e.mutualcount)), :mutual_friends);\n",
    "EdgeDescriptor(g)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.5.0-rc2",
   "language": "julia",
   "name": "julia-0.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
